%%{init: {'theme': 'base', 'themeVariables': { 'fontSize': '40px'}}}%%
flowchart TD
A[Data types] --> B[Categorical]
B --> C(Nominal)
B --> D(Ordinal)
A --> E(Numerical)
E --> F(Discrete)
E --> G(Continuous)
Don’t work on your master file
First steps - Checks
%%{init: {'theme': 'base', 'themeVariables': { 'fontSize': '40px'}}}%%
flowchart TD
A[Data types] --> B[Categorical]
B --> C(Nominal)
B --> D(Ordinal)
A --> E(Numerical)
E --> F(Discrete)
E --> G(Continuous)
Nominal
Ordinal
Discrete
Continuous
Nominal
[study group, control group][male, female]Ordinal
[strongly disagree, disagree, agree, strongly agree][grade 1, grade 2]Discrete
[1,2,3,4...]Continuous
[173.1, 181.3, 193.0 ...]When we report on data we want to provide summary information of it, without having to provide the entire dataset itself.
Common things to report:
i.e. measures of central tendancy
Mean
“average”
sum of all values / the number of values
[1, 6, 2, 7, 4, 7, 3]
(1+6+2+7+4+7+3) / 7 = 4.286
Median
“middle number”
[1, 2, 3, 4 , 6, 7, 7] = 4
[4, 5, 7 | 8, 10, 12] = 7.5
Mode
“most common number”
[1, 6, 2, 7, 4, 7, 3]
= 7
age = [18, 19, 20, 18, 21, 22, 23, 65]
mean(age) = 25.75
median(age) = 20.5
mode(age) = 18
Quartiles
[14,15,16,17, 18,19,20,21, 22,25,27,28, 30,32,34,40,55]
Standard deviation
[2,5,3,4,7,1,8,10]
mean = 5
Standard deviation
| Value |
|---|
| 2 |
| 5 |
| 3 |
| 4 |
| 7 |
| 1 |
| 8 |
| 10 |
Standard deviation
| Value | Deviation |
|---|---|
| 2 | -3 |
| 5 | 0 |
| 3 | -2 |
| 4 | -1 |
| 7 | 2 |
| 1 | -4 |
| 8 | 3 |
| 10 | 5 |
| Total | 0 |
Standard deviation
| Value | Deviation | Deviation2 |
|---|---|---|
| 2 | -3 | 9 |
| 5 | 0 | 0 |
| 3 | -2 | 4 |
| 4 | -1 | 1 |
| 7 | 2 | 4 |
| 1 | -4 | 16 |
| 8 | 3 | 9 |
| 10 | 5 | 25 |
| Total | 0 | 68 |
Variance = sum(Deviation^2) / n-1
Variance = 68 / 7
Variance = 9.714
SD = squareroot(variance)
SD = 3.117
Summary:
mean 5 (SD 3.117)
Minimum and maximum
Range
Both can be influenced by outliers and extreme values
| Measure | Excel function |
|---|---|
| Mean | AVERAGE() |
| Median | MEDIAN() |
| Mode | MODE() |
| Q1 | QUARTILE(…, 1) |
| Q3 | QUARTILE(…, 3) |
| Standard Deviation | STDEV() |
| Minimum | MIN() |
| Maximum | MAX() |
how far away the sample mean is likely to be from the true population mean
Standard Deviation divided by the sample size
\[SE = \frac{SD}{\sqrt(n)}\]
“If we repeated the same measure again in future, 95% of the time the 95% confidence interval will cover the parameter (point estimate)”
“If we repeated the same calculation again in future on different samples, 95% of the time the 95% confidence interval will cover the parameter (point estimate)”
Common misunderstandings of CIs:
Free and open source statistical software
Designed to be easy to use, and with an interface similar to outer statistical programs
Uses:
A plot should be easy to understand in just a few seconds…
A plot should be easy to understand in just a few seconds…
A plot should be easy to understand in just a few seconds…
Some simple advice from Murrell, 2013:
| cyl | mpg | |
|---|---|---|
| Mazda RX4 | 6 | 21.0 |
| Mazda RX4 Wag | 6 | 21.0 |
| Datsun 710 | 4 | 22.8 |
| Hornet 4 Drive | 6 | 21.4 |
| Hornet Sportabout | 8 | 18.7 |
| Valiant | 6 | 18.1 |
| Duster 360 | 8 | 14.3 |
| Merc 240D | 4 | 24.4 |
| Merc 230 | 4 | 22.8 |
| Merc 280 | 6 | 19.2 |
My suggestion:
The art and science of using data from a sample to make inferences on a population, based on certain characteristics from the data
A common approach to this is called frequentist inference - which frames an analysis in terms of how likely (or unlikely) a result is to occur.
| Dependant variable | Independant variable |
|---|---|
| Outcome variable | Explanatory variable |
| Response variable | Predictor variable |
| y-variable | x-variable |
Blood pressure ← Diabetes Status
Jumping height ~ Gender
Difference in proportions
| Diabetes | Status | ||
|---|---|---|---|
| gender | No | Yes | Total |
| female | 18 | 8 | 26 |
| male | 14 | 10 | 24 |
| Total | 32 | 18 | 50 |
Difference in means
| Diabetes | Cholesterol | |
|---|---|---|
| Mean (SD) | No | 176 (15.4) |
| Yes | 212 (17.4) |
Difference in many means
Correlation
A crude measure of how related 2 variables are
Correlation - be careful!
Regression Modelling
A more complex approach to looking at relationships between variables
Model the effect of different explanatory variables on an outcome variable
Allows for
OpenIntro Statistics - FREE! - https://www.openintro.org/book/os/
Andy Field et al Discovering Statistics using SPSS / Discovering Statistics using R
La Trobe Graduate Research School RED Courses - https://www.latrobe.edu.au/researchers/grs/red/workshops-seminars
Google / ChatGPT are good helpers
For more complicated data work, or if you are considering a research career consider learning a scripting language such as R, python etc
Any questions?